In [13]:
%matplotlib inline
You can find this notebook at bit.ly/pydsmmeetup.
This presentation will guide you through a short introduction to working with distributional semantic models (DSM) using the Python package PyDSM.
To successfully install the package, you first need the following installed locally:
I recommend using the scientific Python distribution Anaconda to ease the process.
When you have installed the requirements, this should do the trick:
git clone https://github.com/jimmycallin/pydsm
cd pydsm
python setup.py install
Make sure everything is installed correctly by open your favorite REPL (which should be IPython), and import the library.
In [2]:
import pydsm
PyDSM is a small Python library made for exploratory analysis of distributional semantic models. So far, two common models are available: CooccurrenceDSM
and RandomIndexing
. RandomIndexing is an implemented version of the model we use at Gavagai. For a detailed explanation, I recommend reading this introduction.
In [4]:
dsm = pydsm.build(model=pydsm.CooccurrenceDSM,
corpus='ukwac.100k.clean.txt',
window_size=(2,2),
min_frequency=50)
We removed all words having a total occurrence frequency below 50 to keep things clean.
In [28]:
# Sort columns and display some word vectors
dsm[['never', 'gonna', 'give', 'you', 'up'],['never', 'gonna', 'let', 'you', 'down']]
Out[28]:
In [4]:
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
In [5]:
ppmi.nearest_neighbors('blue')
Out[5]:
In [6]:
car_brand = dsm.compose('car', 'brand', comp_func=pydsm.composition.multiplicative)
car_brand
Out[6]:
In a perfect world, its neighbors would contain a lot more politics, rather than a mixture of each word's neighbors. Still, addition of distributional vectors have shown to perform best in many machine learning tasks when using the composed vector as input.
In [7]:
ppmi.nearest_neighbors(car_brand)
Out[7]:
We will evaluate the quality our DSM with a commonly used synonym test, TOEFL. This test contains 80 questions of type "what is the best synonym for a word given a set of similar words".
hasten: accelerate, accompany, determine, permit
tranquillity: harshness, peacefulness, weariness, happiness
make: print, trade, borrow, earn
This corresponds to measuring the distance to all candidates and pick the word that is closest to the problem word.
In [8]:
dsm.evaluate(pydsm.evaluation.toefl)
Out[8]:
That's not even close to 100%, and there are several reasons for this. Most notably, simple cooccurrence counting will lead to distributional vectors skewed towards ubiquotous words such as function words. We need to have a weighting function that decrease the importance of commonly occuring words, and increase the importance of words that occur in specific contexts. This is what Positive Pointwise Mutual Information (PPMI) does.
In [15]:
dsm.visualize(vis_func=pydsm.visualization.heatmap)
In [14]:
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
ppmi.visualize(vis_func=pydsm.visualization.heatmap)
Visualizing them as heatmaps, we can see how the vector values are more evenly distributed after applying PPMI.
See Bullinaria & Levy "Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study" (2007) for details.
In [11]:
ppmi.evaluate(pydsm.evaluation.toefl)
Out[11]:
While 82.5% is a significant improvement, we still have a long way to go. B&L (2012) [1] succeeded in getting 100% on the TOEFL test. We will be using some of their techniques to improve our own score, but since our corpus is about a 50th of the size of theirs we won't be getting that much improvements. Still, the technique is interesting.
Singular Value Decomposition (SVD) has been known for quite a while to improve the quality, and is a part of the LSA model. Traditionally, the first couple of hundreds or thousands principal components have been used based on the belief that the rest mostly contains additional noise. This has indeed increased the performance. What B&L found out was that by also removing the first 100-200 principal components, the result increased even further.
[1] Bullinaria, J.A. & Levy, J.P. Extracting Semantic Representations from Word Co-occurrence Statistics: Stop-lists, Stemming and SVD. (2012)
In [14]:
# Sort the matrix columns according to the collapsed column sum in descending order.
# This is basically equivalent to sort according to word frequency.
dsm.matrix = dsm.matrix.sort(key='sum', axis=1, ascending=False).sort(key='sum', axis=0, ascending=False)
# Pick out the 50k first columns to ease the SVD.
dsm.matrix = dsm.matrix[:50000,:50000]
# Perform PPMI on this again
ppmi = dsm.apply_weighting(pydsm.weighting.ppmi)
In [7]:
# Perform SVD picking out the 3000 largest principal components. This is slow (hours) and requires a lot of memory.
u, s, v_t = ppmi.matrix.svd(k=3000)
dim_reduced = u.multiply(s)
dim_reduced
Out[7]:
Because of the nature of SVD and its matrix factorization (both matrices are orthogonal, thus v_t
is nothing more than a rotation of the vector space) and the fact that our distance measure is the cosine distance, we can use u * s
directly and discard v_t
. See the paper for details.
In [10]:
pydsm.evaluation.toefl(dim_reduced)
Out[10]:
Running the evaluation only on the SVD matrix actually worsens the result compared to the original PPMI matrix. This is expected (see paper for why). But what happens next when we remove the largest principal components is interesting.
In [16]:
# Ignoring the largest principal components
pydsm.evaluation.toefl(dim_reduced[:,200:])
Out[16]:
Why removal of the vectors that contain the directions of the largest variances improves the result is not entirely clear. When B&L adds extra random noise in addition to the model, it is clearly shown that removing the first principal components also removes most of the added noise. The previous assumption that noise is mainly gathereed in the smallest principal components is simply not true.
And this is as far as we get. Because of some parameter choices to ease the calculation and our smaller corpus, we don't quite reach the same score as B&L. Feel free to try to get all the way to the top. And if you do, I would be very interested to hear about it!